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Abstract 

For tensor decompositions such as HOSVD and 
ParaFac, the objective functions are nonconvex. This im- 
plies, theoretically, there exists a large number of local op- 
timas: starting from different starting point, the iteratively 
improved solution will converge to different local solutions. 
This non-uniqueness present a stability and reliability prob- 
lem for image compression and retrieval. In this paper, we 
present the results of a comprehensive investigation of this 
problem. We found that although all tensor decomposition 
algorithms fail to reach a unique global solution on random 
data and severely scrambled data; surprisingly however, on 
all real life several data sets ( even with substantial scramble 
and occlusions), HOSVD always produce the unique global 
solution in the parameter region suitable to practical appli- 
cations, while ParaFac produce non-unique solutions. We 
provide an eigenvalue based rule for the assessing the solu- 
tion uniqueness. 



1. Introduction 

Tensor based dimension reduction has recently been ex- 
tensively studied for computer vision, pattern recognition, 
and machine learning applications. Typically, such ap- 
proaches seek subspaces such that the information are re- 
tained while the discared subspaces contains noises. Most 
tensor decomposition methods are unsupervised which en- 
able researchers to apply them in any machine learn- 
ing applications including unsupervised learning and semi- 
supervised learning. 

Perhaps High Order Singular Value Decomposition 
(HOSVD) [?] [7] and Parallel Factors (ParaFac) are some 
of the most widely used tensor decompositions. Both of 
them could be viewed as extensions of SVD of a 2D ma- 
trix. HOSVD is used in computer vision by Vasilescu and 
Terzopoulos [?] while ParaFac is used in computer vision 
by Shashua and Levine [?]. More recently, Yang et al. [?] 



proposed a two dimensional PCA (2DPCA) Ye et al. [?] 
proposed a method called Generalized Low Rank Approxi- 
mation of Matrices (GLRAM). Both GLRAM and 2DPCA 
can be viewed in the same framework in 2DSVD (two- 
dimensional singular value decomposition) [?]. and solved 
by non-iterative algorithm [5] The error bounds of HOSVD 
have been derived [3] and the equivalence between tensor 
i-sT-means clustering and HOSVD is also established [?]. 

Although tensor decompositions are now widely used, 
many of their properties so far have not been well charac- 
terized. For example, the tensor rank problem remains a 
research issue. Counter examples exist that argue against 
optimal low-dimension approximations of a tensor. 

In this paper, we address the solution uniqueness issues. 
This problem arises because the tensor decomposition ob- 
jective functions are non-convex with respect to all the vari- 
ables and the constraints of the optimization are also non- 
convex. Standard algorithms to compute these decompo- 
sitions are iterative improvement. The non-convexity of 
the optimization implies that the iterated solutions will con- 
verge to different solutions if they start from different initial 
points. 

Note that this fundamental uniqueness issue differs from 
other representation redundancy issues, such as equivalence 
transformations (i.e. rational invariance) that change indi- 
vidual factors ([/, V, W) but leaves the reconstructed image 
untouched. These representation redundancy issues can be 
avoided if we compare different solutions at the level of re- 
constructed images, rather in the level of individual factors. 

The main findings of our investigation are both surpris- 
ing and comforting. On all real life datasets we tested (we 
tested 6 data sets and show results for 3 data set due to space 
limitation), the HOSVD solutions are unique (i.e., different 
initial starts always converge to an unique global solution); 
while the ParaFac solution are almost always not unique. 
Furthermore, even with substantial randomizations (block 
scramble, pixel scramble, occlusion) of these real datasets, 
HOSVD converge to unique solution too. 

These new findings assure us that in most applications 
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using HOSVD, the solutions are unique — the results are 
repeatable and reliable. 

We also found that whether a HOSVD solution is unique 
can be reasonably predicted by inspecting the eigenvalue 
distributions of the correlation matrices involved. Thus the 
eigenvalue distributions provide a clue about the solution 
uniqueness or global convergence. We are looking into a 
theoretical explanation of this rather robust uniquenss of 
HOSVD. 

2. Tensor Decomposition 

2.1. High Order SVD (HOSVD) 

Consider 3D tensor: X = {X^^^l . The ob- 
jective of HOSVD is to select subspace U, V„ W and core 
tensor S such that the L 2 reconstruction error is minimized, 

min Ji = \\X -U ®iV ® 2 W ®zS\\ 2 (1) 
u,v,w,s 

where U G 5ft™ 1 xmi , V G 5ft™ 2 x ™ 2 , W G 5ft™ 3 x ™ 3 , S G 
5ft" 11 x ™2 x m 3 . using explicit index, 

Jl = ^2 \ Xijk — ^ UipVjqWkrSpqr^J ■ (2) 
ijk pqr 

In HOSVD, W, U, V are required to be orthogonal: 

u T u = /, V T V = I, W T W = I. 

With the orthonormality condition, setting d J\ / dS = 0, we 
obtain S = U T ®i V T ® 2 W T ® 3 X, and J x = \\X\\ 2 - 
\\S\\ 2 . Thus HOSVD is equivalent to maximize 

max ||5|| 2 = \\U T ® x V T ® 2 W T ® 3 X\\ 2 (3) 
u,v,w 

= Tr U T FU (4) 

= Tr V T GV (5) 

= Tr W T HW. (6) 

where 

F lV = XijtXvMVV T ) j ?(WW T ) u < (7) 

jj'W 

Gjf = ^ X ij iX VJ u>(UV T ) ii >(WW T )u' (8) 

WW 

H u > = XiitXi'MUU T )ii>(yV T ) jj > (9) 

ii'jj' 

Standard HOSVD algorithm starts with initial guess of of 
(U, V, W) and solve Eqs(3,4,5) alternatively using eigen- 
vectors of the corresponding matrix. Since F, G, H are 



semi -positive definite, ||5|| 2 are monotonically increase 
(non-decrease). Thus the algorithm converges to a local op- 
timal solution. 

HOSVD is a nonconvex optimization problem: The ob- 
jective function of Eq.(2) w.r.t. (17, V, W) is nonconvex 
and the orthonormality constraints of Eq.(2) are nonconvex 
as well. It is well-known that for nonconvex optimization 
problems, there are many local optimal solutions: starting 
from different initial guess of (U, V, W), the converged so- 
lutions are different. Therefore theoretically, solutions of 
HOSVD are not unique. 

2.2. ParaFac decomposition 

ParaFac decomposition [4, 2] is the simplest and also 
most widely used decomposition model. It approximates 
the tensor as 

R R 

x^y, u(r) ® r(r) ® wW ' or X ^ ~ X! U ir V jr W kr 

r—1 r—1 

(10) 

where R is the number of factors and U = 
(u«--- ,u (fi) ), V = (v^ 1 ),.-. ,v<*>), W = 
( w (!) ; ••• ; v/( RS> ). ParaFac minimizes the objective 

"1 »2 "3 R 

^ParaFac = ^2 X X I \^V k ~ X Ui r V Jr Wkr if (1 1) 
i=l j=l k=l r=l 

We enforce the implicit constraints that columns of U = 
(u^\ ■ ■ ■ , u( R ^) are linearly independent; columns of V = 
(v^ 1 ', ■ ■ • , v( R ') are linearly independent; and columns of 
W = (w' 1 ' , ■ ■ • , are linearly independent. 

Clearly the ParaFac objective function is nonconvex in 
(U,V,W). The linearly independent constraints are also 
nonconvex. Therefore, the ParaFac optimization is a non- 
convex optimization. 

Many different computational algorithms were devel- 
oped for computing ParaFac. One type of algorithm uses 
a sequence of rank-1 approximations [10, 6, ?]. However, 
the solution of this heuristic approach differ from (local) 
optimal solutions. 

The standard algorithm is to compute one factor at a time 
in an alternating fashion. The objective decrease mono- 
tonically in each step, and the iteration converges to a (lo- 
cal) optimal solution. However, due to the nonconvexity of 
ParaFac optimization, the converged solution depends heav- 
ily on the initial starting point. For this reason, the ParaFac 
is often not unique. 

3. Unique Solution 

In this paper, we investigate the problem of whether the 
solution of a tensor decomposition is unique. This is an 



important problem, because if the solutions is not unique, 
then the results are not repeatable and the image retrieval is 
not reliable. 

For a convex optimization problem, there is only one lo- 
cal optimal solution which is also the global optimal so- 
lution. For a non-convex optimization problem, there are 
many (often infinite) local optimal solutions: converged so- 
lutions of the HOS VD/ParaFac iterations depend on the ini- 
tial starting point. 

In this paper, we take the experimental approach. For a 
tensor decomposition we run many runs with dramatically 
different starting points. If the solutions of all these runs 
agree with each other (to computer machine precision), then 
we consider the decomposition has a unique solution. 

In the following, we explain the (1) The dramatically 
different starting point for (U, V, W). (2) Experiments on 
three different real life data sets. (3) Eigenvalue distribu- 
tions which can predict the uniques of the HOSVD. 

4. A natural starting point for W: the Tl 

decomposition and the PCA solution 

In this section, we describe a natural starting point for 
W. Consider the Tl decomposition [?] 

ma rii3 
X ijk « J2 C ^W k k> or x\f « J2 C^' } Wkk'. (12) 

k' = l k' = l 

C, W are obtained as the results of the optimization 

B3 rn 3 

min J T1 =£ll* (fc) -£ C(r WH 2 - d3) 

k=l k' = l 

This decomposition can be reformulated as the following: 

J Tl = \\X\\ 2 -7v{W T HW), (14) 
where 

H kk , = Tr (X« [X (k 'Y) = ]T X ijk X ijk , . (15) 

y 

Cis given by C< r » = YlU x(k) W kr . 

This solution is also the PCA solution. The reason is 
the following. Let A = (oi, • • • , a n ) be a collection of ID 
vectors. The corresponding covariance matrix is AA T and 
Gram matrix is A T A. Eigenvectors of A T A are the prin- 
cipal components. Coming back to the Tl decomposition, 
H is the Gram matrix if we consider each image X^ as a 
ID vector. Solution for W are principal eigenvectors of H, 
which are the principal components. 

5. Initialization 

For both HOSVD and ParaFac, we generate 7 different 
initializations: 



(Rl) Use the PCA results W as explained in §4. Set V to 
identity matrix (fill zeros in the rest of the matrix to fit the 
size of tt-2 x mi). This is our standard initialization. 

(R2) Generate 3 full-rank matrixes W and V with uni- 
form random numbers of in (0, 1). 

(R3) Randomly generate 3 rank deficient matrices W 
and V with proper size. For first initialization, we randomly 
pick a column of W and set the column to zero. The rest of 
columns are randomly generated as in (R2) and the same for 
V. For second and third initializations, we randomly pick 
two or three columns of W and set them to zero, and so on. 
Typically, we use m\ = 777,2 = m% = 5 ~ 10. Thus the 
rank-deficiency at 7713 = 5 is strong. 

We use the tensor toolbox [ ]. The order of update in the 
alternating updating algorithm is the following: (1) Given 
(V,W), solve for U (to solve Problem 4); (2) Given (U,W), 
solve for V (Problem 5); (3) Given (U, V), solve for W 
(Problem 6); Go back to (1) and so on. 

6. Run statistics and validation 

For each dataset with each parameter setting, we run 10 
independent tests. For each test, we run HOSVD iterations 
to convergences (because of the difficulty of estimating con- 
vergence criterion, we run total of T=100 iterations of alter- 
nating updating which is usually sufficient to converge). 

For each independent test, we have 7 different solutions 
of (Ui, Vi, Wi) where i = 1, 2, ■ • ■ ,7 for the solution start- 
ing from the 7-th initialization. We use the following differ- 
ence to verify whether the solutions are unique: 

d(t) = I J2 (W U i ~ U *W + W V i - V ? II + W W i ~ Wt ID . 

i=1 

where we introduce the HOSVD iteration index t, and 
U\, V*, W\ are the solution in £-th iteration. 

If an optimization problem has a unique solution, d(t) 
typically starts with nonzero value and gradually decrease 
to zero. Indeed, this occurs often in Figure 2 The sooner 
d(t) decreases to zero, the faster the algorithm converges. 
For example, in the 7th row of Figure 2, the mi = m^ = 
m 3 = 5 parameter setting, the algorithm converges faster 
than the m\ = 777,2 = rri3 = 10 setting. 

In our experiments, we do 10 different tests (each with 
different random starts). If in all 10 tests d(t) decreases to 
zero, we say the optimization has a unique solution (we say 
they are globally convergent). 

If an optimization has no unique solution (i.e., it has 
many local optima), d(t) typically remains nonzero at all 
times, we say the solution of HOSVD is not unique. In 
Figure 1, we show the results of HOSVD and ParaFac on 
a random tensor. One can see that in each of the 10 tests, 
shown as 10 lines in the figure, none of them ever decrease 
to zero. 



For ParaFac we use the difference of reconstructed tensor 
to evaluate the uniqueness of the solution: 

d '(*) = ^En^-^iii. ( 16 > 

where X\ is the reconstruction tensor in the <-th iteration 
with the i-th starting point. ParaFac algorithm converge 
slower than HOSVD algorithm. Thus we run 2000 itera- 
tions for each test. 

7. Eigenvalue Distributions 

In these figures, the eigenvalues of F, G, and 
H are calculated using Eqs.(7,8,9), but setting all 
UU T , VV T , WW T as identity matrix. The matrices are 
centered in all indexes. The eigenvalues are sorted and nor- 
malized by the sum of the all the eigenvalues. 

For WANG dataset, we also show the result of mi = 
7712 = 2, ui3 = 4 and mi = m-i = ni-j = 3 for 80% pixel 
scramble in the last row of the top part of Figure 3. For 
101 dataset, we add results mi = 7712 = ms = 30 and 
mi = 1712 = ni3 = 80 int the last row of Figure 4. 

8. Datasets 

The first benchmark is face databases AT&T [?] in which 
there are ten different images of each of 40 distinct subjects. 
We use the original size of the image. All 400 images form 
a 112 x 92 x 400 tensor. 

The second image dataset is WANG [ ] which contains 
10 categories (Africa, Bench, Buildings, Buses, Dinosaurs, 
Elephants, Flowers, Houses, Mountains, and Food) and 100 
images for each category. The original size of the image is 
either 384 x 256 or 256 x 384. We select Buildings, Buses, 
and Food categories and resize the images into a 100 x 100 
size. We also transform all images into 0-255 level gray 
images. The selected images form a 100 x 100 x 300 tensor. 

The third dataset is Caltech 101 [S] which contains 101 
categories. About 40 to 800 images per category. Most cat- 
egories have about 50 images. Collected in September 2003 
by Li, Andreetto, and Ranzato. The size of each image is 
roughly 300 x 200 pixels. We randomly pickup 200 images, 
resize and transform them into 100 x 100 0-255 level gray 
images to form a 100 x 100 x 200 tensor. 

9. Image randomization 

Three types randomization are considered: block scram- 
ble, pixel scramble and occlusion. In block scramble, an 
image is divided into n = 2, 4, 8 blocks; blocks are scram- 
bled to form new images (see Figure 2, 3 and 4). 

In pixel sample, we randomly pick up a = 
40%, 60%, 80% of the pixels in the image, and randomly 



scramble them to form a new image (see Figure 2, 3, and 
4). 

We also experimented with occulsions with sizes upto 
half of the images. We found that occulsion consistently 
produce smaller randomization affects and HOSVD results 
converge to the unique solution. For this reason and the 
space limitation, we do not show the results here. 

10. Main Results 

From results shown in Figures 2, 3, and 4. we observe 
the following: 

1 . For all tested real-life data, ParaFac solutions are not 
unique, i.e., the converged solution depends on initial 
starts. This is consistent with the non-convex opti- 
mization as explained in §2.2. 

2. For all tested real-life data, HOSVD solutions are 
unique, although theoretically, this is not guarrentteed 
since the optimization of HOSVD is non-convex as ex- 
plained in §2.1; 

3. For even heavily rescrambled (randomized) real-life 
data, HOSVD solutions are also unique; This is sur- 
prsing, given that the HOSVD optimization are non- 
convex. 

4. For very severelly rescrambled real-life data and pure 
randomly generated data, HOSVD solutions are not 
unique. 

5. The HOSVD solution for a given dataset may be 
unique for some parameter setting but non-unique for 
some other parameter setting. 

6. Whether the HOSVD solution for a given dataset will 
be unique can largely be predicted by inspecting the 
eigenvalue distribution of the matrices F, G, H. See 
next section. 

11. Eigenvalue-base uniqueness prediction 

We found Empirically that the eigenvalue distribution 
help to predict whether the HOSVD solution on a dataset 
with a parameter setting is unique or not. 

For example, in AT&T dataset HOSVD converges in 
all parameter settings except in 8 x 8 block scramble with 
mi = ma = m 3 = 5. This is because the ignored 3 eigen- 
modes have very similar eigenvalues as the first five eigen- 
values. It is ambiguous for HOSVD to select which of the 8 
significant eigenmodes. Thus HOSVD fails to converge to 
a unique solution. 

But when we increase mi, 7712,7713 to 10, all 8 signif- 
icant eigenmodes can be selected and HOSVD converges 
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Figure 1. HOSVD and ParaFace Convergence on a 100 x 100 x 100 
random tensor. 

to a unique solution. This also happens in the other two 
datasets (see the forth rows in top part of Figures 3 and 4. 

For 80% pixel scramble in dataset WANG, when mi = 
rrc.2 = TO3 = 5, 10, HOSVD is ambiguous as to select 
eigenmodes because there are a large number of them with 
nearly identical eigenvalues around the cutoff. However, 
if we reduce the dimensions to mi = ni2 = 2, m% = 4 
or mi = m 2 = m 3 = 3, this ambiguity is gone: HOSVD 
clearly selects the top 2 or 3 eigenmodes. converges (see the 
last row of the top panel in Figure 3). This same observation 
also applies to Caltech 101 dataset at 80% pixel scramble in 
101 (see the last row of the top part of Figure 4). 

For random tensor shown in Figure 1 , the eigenvalues 
are nearly identical to each other. Thus for both parameter 
setting (mi = mo, = ms = 5 and mi = mi = ma = 10), 
HOSVD is ambiguous to selection eigenmodes and thus 
does not converge. 

We have also investigated the solution uniqueness prob- 
lem of the GLRAM tensor decomposition. The results are 
very close to HOSVD. We skip it due to space limitation. 

12. Summary 

In summary, for all real life datasets we tested, the 
HOSVD solution are unique (i.e., different initial starts al- 
ways converge to an unique global solution); while the 
ParaFac solution are almost always not unique. These find- 
ing are new (to the best of our knowledge). They also sur- 
prising and comforting. We can be assured that in most ap- 
plications using HOSVD, the solutions are unique — the re- 
sults are reliable and repeatable. In the rare cases where the 
data are highly irregular or severely distored/randomized, 
our results indicate that we can predict whether HOSVD so- 
lution is unique by inspecting the eigenvalue distributions. 
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Figure 2. AT&T dataset (400 images of size 112 x 92 each). Shown are eigenvalues of F, G, H, and solution uniqueness of HOSVD and 
ParaFac. 
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Figure 3. WANG dataset (300 images, 100 x 100 size for each). Shown are eigenvalues of F, G, H, and solution uniqueness of HOSVD 
and ParaFac. 
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Figure 4. Caltech 101 dataset (200 images of size 100 x 100 each). Shown are eigenvalues of F, G, H, and solution uniqueness of 
HOSVD and ParaFac. 
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